Public: Technology Reviews : Storing Java jars and classes in git
This page last changed on Sep 10, 2008 by stepheneb.
Storing Java jars and classes in gitI want to solve two problems.
I've been using git a great deal recently and have been very impressed. I think that we could get huge improvements by only using Java web start to install a minimal app that used the pure Java jgit library to get the rest of the resources that used to be delivered by web start with git instead. I also think we could get the same benefit by using git to transfer content. FYI: the very best intro to git for a programmer is Scott Chacon's Rails Conf talk. He's made it available here on his gitcasts web site as a screencast: http://www.gitcasts.com/posts/railsconf-git-talk Table of ContentsUnknown macro: {section}
Unknown macro: {column}
Unknown macro: {column} Tests with the versioned OTrunk jarI wanted to see what the possibilities were if we used git to both create a repository for our java code and used it's transport mechanism to move differences. I grabbed copies of all 226 version of the otrunk jars (from: V0.1.0-20070419.221212-115 to: V0.1.0-20080717.124415-532) here: http://jnlp.concord.org/dev/org/concord/otrunk Altogether these jars added up to 83MB. I then ran two experiments with two variations.
Initial conditions:
Experiment 1, jars.
experiment 2, expanded jars.
The size of the raw differences are also tiny. Here's the size in kB generated by git diff --raw comparing the HEAD of the master branch with both the oldest and most recent commits. puts sprintf('%.1f', `git diff master~1 --raw | wc -c`[/\s*(\d+)/, 1].to_f / 1024) # => 0.4 puts sprintf('%.1f', `git diff master~225 --raw | wc -c`[/\s*(\d+)/, 1].to_f / 1024) # => 24.0 Using jardiff to generate a difference file for the most recent two revisions created a file that was about 24k. That means using git produced a difference file that was about 50 times smaller than the equaivalent difference file generated by the Java web start servlet with jardiff. Jardiff works by sending only the classes that have changed. In these tests git goes one very big step further by just sending the differences in the classes that have changed. I don't have accurate numbers for the difference produced by jardiff for the oldest revision because it was larger than the 122k size of the pack.gz version of the jar file. I also found out you can easily use git to generate a stream containing any revision of a file which can then be piped to a file or a socket. Here's the ruby file I used to generate the tests: file: otrunk_jars.rb #!/usr/bin/env ruby require 'rubygems' require 'git' require 'fileutils' include FileUtils::Verbose def jar_name_and_version(path) path[/.*\/(.*)__(.*).jar/, 1] ["#{$1}.jar", $2] end rm_rf('otrunk-classes-git') mkdir('otrunk-classes-git') cd('otrunk-classes-git') do git = Git.init otrunk_jars = Dir.glob('../otrunk-jars/*') otrunk_jars.each do |jar| name, version = jar_name_and_version(jar) rm_rf(Dir.entries('.')-%w{. .. .git}) cp(jar, name) `unzip #{name}` rm(name) git.add('*') git.commit_all("adding all the content from the unzipped: #{name}, version: #{version}") git.add_tag(version) end puts "uncompressed size of git dir: \n#{`du -chd0 .git`}" puts "running: 'git gc':\n#{`git gc`}" puts "compressed size of git dir: \n#{`du -chd0 .git`}" puts end rm_rf('otrunk-jars-git') mkdir('otrunk-jars-git') cd('otrunk-jars-git') do git = Git.init otrunk_jars = Dir.glob('../otrunk-jars/*') otrunk_jars.each do |jar| name, version = jar_name_and_version(jar) cp(jar, name) git.add(name) git.commit("adding #{name}, version: #{version}") git.add_tag(version) end puts "uncompressed size of git dir: \n#{`du -chd0 .git`}" puts "running: 'git gc':\n#{`git gc`}" puts "compressed size of git dir: \n#{`du -chd0 .git`}" puts end Tests with the resources used by the all-otrunk-snapshot jnlpsI decided to truly abuse git and created a git repository that contain all the resources referenced by ALL 634 of the versioned all-otrunk-snapshot.jnlps located here: *http://jnlp.concord.org/dev/org/concord/maven-jnlp/all-otrunk-snapshot/* On the cc jnlp server in this dir: /home/sbannasch/src I've got these two ruby scripts:
The all-otrunk-classes-git repository contains all the content in all the jars and native libraries referenced by all 634 revisions of all-otrunk-snapshot.jnlps for the last 15 months
The content of all the jars and native libraries were unzipped before committing to the git repository. Git efficiency at generating diffs and compressing content is much better when it is working with a large collection of smaller files than a smaller collection of jar files into which the original content has been packed. Each maven jnlp version of the resource set was committed, tagged. In addition a local branch was made referencing this commit. The repository is 111 MB (size of the .git dir). The working directory is an additional 226 MB. The content in the working directory is created when a branch or tag is checked out of the repository (from content stored in the '.git' directory). If you were to clone this git repository the data transferred would be about 111 MB. When the all-otrunk jnlp is run from Java Web Start the jar and nativelib resources take just about 70MB in the web start cache Here are some measurements of repository performance with respect to starting with tag: 0.1.0-20080718.202956 checked out in the working directory. First I created local branches for all of the tags I'm testing, each local branch has a name with this form: local_<tag> So the state of the commit object in the repository associated with tag 0.1.0-20080718.174145 is also a local branch named local_0.1.0-20080718.174145. About the data collected: The diff value is the size of what would be transferred over the network if you were updating from the older tag to the most recent tag. The 'time to calculate diff' is based on generating the diff between the most recent tag: 0.1.0-20080718.202956 and the selected tag. The checkout time is the time taken on troy to checkout that local branch for that tag starting at the initial condition of having the most local branch (master) for the most recent tag (0.1.0-20080718.202956) checked out. Previous tags: 1, 2, 3, 4, 5
Previous tags: 10, 20, 30, 40, 50
Previous tags: 100, 200, 300, 400, 500
Let the data in that table sink in a bit ... |
Document generated by Confluence on Jan 27, 2014 16:56 |